This notebook performs Exploratory Data Analysis on BERA's backfilled data narrowed to WB's own brands. It analyzes and creates visualizations of the four KPIs, five Marketing Levers, and Imagery Brands Personality for each of our brands. It also includes my findings and provides some busienss insights and recommendations. Furthue analysis can be conducted after we joining with the Neustar and MA data, as well as when we can define our goal and have some business quesitons to answer.
BERA is a brand management company that provide brand tracking in real time. It collected brands data by distributing surveys with 1M participates each year with 4000 brands across 200 sectors in 87 markets globally with census data matched. The survey was conducted by invitation for participants to complete via mobile device ot desktop browser with cash incentives. The survey scape is population balanced and market-defined with careful structured volumn and frequency system.
The survey syndicated metrics and measures includes 4 KPIs, 5P marketing levers, and 5 imagery categories with 42 adjectives.
The 4 KPIs includes Familarity, Meaningfulness, regard and uniqueness, which is very straighforward to understand that how farmiliar you are with and how highly you regard this brand, as well as how meaningful/relevant and how competitively unique the brand is. The 5P marketing levers includes Product - how much it provides a product or service that meets my needs, Price - your willingness to pay a premium for, Promotion - weather it has advertising communications that are meaningful to me, Place - avaliablity when and where I want it (disitribution), and People - weather it has great people (service, ambassadors, users). Both of the 4KPIs and 5Ps are scaled between 1-7. For the imagery catogories, participates was asked to selected the adjectives that they think related to the brands. Selected would be maked as 1, otherwise 0.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import seaborn as sns
from plotly.subplots import make_subplots
from wordcloud import WordCloud, ImageColorGenerator
from IPython.display import Image
from PIL import Image
from sklearn.cluster import KMeans
from sklearn import preprocessing
import squarify
plt.style.use('ggplot')
sns.set_theme(style="ticks", color_codes=True)
bera = pd.read_csv("bera_backfilled.csv")
bera.head()
bera.shape
#check features and NA
bera.info()
We noticed missing value occured on CVS, LGBT, NPS, PEOPLE, PLACEMENT, PRICE, PRODUCT, and PROMOTION, since those attributes are not so important for now, we just leave them there at this moment.
bera.AUDIENCE.value_counts()
bera.CATEGORY_NAME.value_counts()
bera.DATE_FILTER.value_counts()
bera.FILTER.value_counts()
bera.MONTH.value_counts()
Remove column AUDIENCE, CATEGORY_NAME, CATEGORY_ID as it has identical value.
bera.drop(['AUDIENCE','CATEGORY_ID','CATEGORY_NAME', "METADATA_FILENAME"],
inplace=True, axis=1, errors='ignore')
What date range does the data span?
print(bera.RECORD_DATE.min())
print(bera.RECORD_DATE.max())
bera.RECORD_DATE.value_counts()
BERA_March data records data from 2020-07-13 to 2021-02-01.
How many distinct brands of WB?
bera.BRAND_LABEL.nunique()
The Flintstones (characters) and Flintstones are the same, so let's replace Flintstones with Flintstones (characters)
bera['BRAND_LABEL'] = bera['BRAND_LABEL'].replace(['Flintstones'], 'The Flintstones (characters)')
#check brands distribution
brands = bera.BRAND_LABEL.value_counts().to_frame().reset_index()
brands.columns = ['brand', 'frequency']
brands
brands["studio"] = "WB"
brands["category"] = ['Superhero', 'Animation', 'Superhero', 'Animation', 'Other Franchise',
'Superhero', 'Superhero', 'Superhero', 'Animation', 'Other Franchise']
fig = px.treemap(brands,
path=['studio', 'category', 'brand'],
values='frequency')
fig.update_layout(title="WB Brands Composition",
width=1000, height=700,)
fig.show()
#plot brans distribution bar chart
fig = px.bar(brands, x="brand", y="frequency", title="Brands Distribution")
fig.update_layout(width=800, height=500)
fig.show()
From the distribution, we noticed that the frequency of the top 7 are approximately evenly distributed, range from 2718 to 2769, while with only 1547 records of the last one The Conjuring Series.
plt.style.use('ggplot')
plt.figure(figsize=(15, 6))
plt.hist([bera.FAMILIARITY,bera.REGARD, bera.MEANINGFULNESS, bera.UNIQUENESS],
label=['FAMILIARITY', 'REGARD', 'MEANINGFULNESS', 'UNIQUENESS'], alpha = 0.8)
plt.legend(loc='upper left')
plt.suptitle("KPI Distribution", size=20)
plt.show()
We notice a significant high frequency of score 7 of familarity, which can indicate our WB brands overall are very wellknown. So I believe it is the biggetst strength of our WB brands.
fig, axs = plt.subplots(2, 2, figsize=(12,8))
axs[0,0].hist(bera.FAMILIARITY)
axs[0,0].set_title("Familarity Distribution")
axs[0,1].hist(bera.REGARD)
axs[0,1].set_title("Regard Distribution")
axs[1,0].hist(bera.MEANINGFULNESS)
axs[1,0].set_title("Meaningfulness Distribution")
axs[1,1].hist(bera.UNIQUENESS)
axs[1,1].set_title("Uniqueness Distribution")
Each of the KPI distribution plots are approximately left skewed, so the median value is higher than the mean value.
It can didicate our brands are loved my most people. However, one point worth for attention here is that in regarding of the meaningfulness, unlike three other KPIs who has the most frequency on score 7, we have the most amount of frequency of score 4 and 5. Because uniqueness and meaningfulness drive the success of our business, so I believe this can be an area for improvment of our brands.
kpis = bera[["FAMILIARITY", "REGARD", "MEANINGFULNESS", "UNIQUENESS"]]
kpis
sns.set(rc={'figure.figsize':(11.7,8.27)})
mask = np.triu(np.ones_like(kpis.corr(), dtype=np.bool))
heatmap = sns.heatmap(kpis.corr(),vmin=-1, vmax=1, annot = True, cmap='BrBG')
heatmap.set_title('4KPIs Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
Those 4 KPIs are all strongly positively correlated with each other. The Regard and Meaningfullness have the strongest positive correlation with each other. We can interpreted it as that if people think a brand is meaningful to them, then they are more like to have a high regard to this brand as well. However, familarity are least correlated with the other three KPIs. That is if people is farmilar with a brand, although they may more likely to think the brand is unique, meaningful and regard higher, but not so strong as other KPIs.
Product - provides a product or service that meets my needs
Price - willing to pay a premium for
Promotion - has advertising communications that are meaningful to me
Place - avaliable whe and where I want it (disitribution)
People - has great people (service, ambassadors, users)
fig, axs = plt.subplots(2, 3, figsize=(12,8))
axs[0,0].hist(bera.PRODUCT, color='steelblue')
axs[0,0].set_title("Product Distribution")
axs[0,1].hist(bera.PLACEMENT, color='steelblue')
axs[0,1].set_title("Placement Distribution")
axs[0,2].hist(bera.PEOPLE, color='steelblue')
axs[0,2].set_title("People Distribution")
axs[1,0].hist(bera.PRICE, color='brown')
axs[1,0].set_title("Price Distribution")
axs[1,1].hist(bera.PROMOTION, color='brown')
axs[1,1].set_title("Promotion Distribution")
fig.delaxes(axs[1][2])
The price represents people's willingness to pay a premium for our brands, however, we have the most responses of score 1 regrading that. People's willingness to pay can depend on various reasons, including The state of the economy, How trendy/in-season a product is, Consumer’s personal price points, Circumstantial needs in different consumers, The rareness of a product, The quality of a product, etc. We should dig furthur on that after joining with the household information.
In terms of promoriton, we have the most counts on the two extrem sides, and it can indicate our adviertising communication has meaningness for some people, while just not interest to the other.
p5s = bera[["PRICE", "PROMOTION", "PRODUCT", "PLACEMENT", "PEOPLE"]]
p5s
sns.set(rc={'figure.figsize':(11.7,8.27)})
mask = np.triu(np.ones_like(p5s.corr(), dtype=np.bool))
heatmap = sns.heatmap(p5s.corr(),vmin=-1, vmax=1, annot = True, cmap='BrBG')
heatmap.set_title('5Ps Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
All 5Ps are highly correlated with each other, especially b/t price and promotion. So we can conclude that people's willingness to pay highly depends on weather they see the brand's advertising communications were meaningful to them. So let's dig deeper to find out what group of people would see our brands advertising communications were meaningful to them, so we can focus on those people for marketing campaign in the future.
sns.set(rc={'figure.figsize':(11.7,8.27)})
mask = np.triu(np.ones_like(kpis.corr(), dtype=np.bool))
heatmap = sns.heatmap(bera[['PROMOTION','INCOME', 'DEMO_AGE', 'EDUCATION', 'HH_CHILDREN', 'GENDER','REGION_ROLLUP']].corr(),vmin=-1, vmax=1, annot = True, cmap='BrBG')
heatmap.set_title('Promotion & Demographic Correlation Heatmap', fontdict={'fontsize':18}, pad=12);
From the correlation, we find out that Income and number of children have a moderate positive correlation with Promotion, Age and gender have a moderate negative correlation with Promotion, while education and regions keep about neutral. So Let's take a look at how does people's Income, age, gender and number of children ditributed for both highest and lowest promotion.
low_promo = bera.loc[bera['PROMOTION'] == 1]
high_promo = bera.loc[bera['PROMOTION'] == 7]
#get the percentage of each income level for both low and high price
low_perc = low_promo.INCOME.value_counts()/low_promo.INCOME.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['income', 'no_meaningful_promotion']
high_perc = high_promo.INCOME.value_counts()/high_promo.INCOME.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['income', 'has_meaningful_promotion']
#join low and high price together
promo_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'income', right_on = 'income')
promo_perc = promo_perc.sort_values("income")
promo_perc.plot(x="income", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Income Distribution")
positions = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
labels = ["Less than $10,000", "$10,000 to $14,999", "$15,000 to $24,999", "$25,000 to $34,999",
"$35,000 to $49,999", "$50,000 to $74,999", "$75,000 to $99,999", "$100,000 to $149,999",
"$150,000 to $199,999", "$200,000 to $499,999", "$500,000 to $1 million", "$1 million or more"]
plt.xticks(positions, labels, rotation = 30)
plt.show()
#get the percentage of each age level for both low and high price
low_perc = low_promo.DEMO_AGE.value_counts()/low_promo.DEMO_AGE.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['age', 'no_meaningful_promotion']
high_perc = high_promo.DEMO_AGE.value_counts()/high_promo.DEMO_AGE.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['age', 'has_meaningful_promotion']
#join low and high price together
price_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'age', right_on = 'age')
price_perc = price_perc.sort_values("age")
price_perc.plot(x="age", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Age Distribution")
positions = (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
labels = ["18-21", "22-24", "25-29", "30-34", "35-39", "40-44",
"45-49", "50-54", "55-59", "60-64", "65 or Older"]
plt.xticks(positions, labels, rotation = 30)
plt.show()
#get the percentage for each number of childrens for both low and high price
low_perc = low_promo.HH_CHILDREN.value_counts()/low_promo.HH_CHILDREN.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['child_no', 'no_meaningful_promotion']
high_perc = high_promo.HH_CHILDREN.value_counts()/high_promo.HH_CHILDREN.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['child_no', 'has_meaningful_promotion']
#join low and high price together
price_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'child_no', right_on = 'child_no')
price_perc = price_perc.sort_values("child_no")
price_perc.plot(x="child_no", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Number of Children Distribution")
positions = (0, 1, 2, 3, 4)
labels = ["None", "1", "2", "3", "4 or more"]
plt.xticks(positions, labels, rotation = 360)
plt.show()
#get the percentage for each number of childrens for both low and high price
low_perc = low_promo.GENDER.value_counts()/low_promo.GENDER.count()
low_perc = low_perc.to_frame().reset_index()
low_perc.columns = ['gender', 'no_meaningful_promotion']
high_perc = high_promo.GENDER.value_counts()/high_promo.GENDER.count()
high_perc = high_perc.to_frame().reset_index()
high_perc.columns = ['gender', 'has_meaningful_promotion']
#join low and high price together
price_perc = pd.merge(left = low_perc, right = high_perc, left_on = 'gender', right_on = 'gender')
price_perc = price_perc.sort_values("gender")
price_perc.plot(x="gender", y=["no_meaningful_promotion", "has_meaningful_promotion"], kind = "bar", alpha=0.8)
plt.title("High Promotion vs. Low Promotion Gender Distribution")
positions = (0, 1)
labels = ["Male", "Female"]
plt.xticks(positions, labels, rotation = 360)
plt.show()
price_income = bera[['PRICE','INCOME']].dropna()
x1 = price_income.values
x1.shape
wcss_1 = []
for i in range(1, 11):
km = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
km.fit(x1)
wcss_1.append(km.inertia_)
plt.plot(np.arange(1 , 11) , wcss_1 , 'o')
plt.plot(np.arange(1 , 11) , wcss_1 , '-' , alpha = 0.5)
plt.title('The Elbow Method', fontsize = 20)
plt.xlabel('No. of Clusters')
plt.ylabel('wcss_1')
plt.show()
kmeans_1=KMeans(n_clusters=4,init='k-means++',max_iter=300,n_init=10,random_state=0)
y_kmeans_1=kmeans_1.fit_predict(x1)
labels_1 = kmeans_1.labels_
centroids_1 = kmeans_1.cluster_centers_
labels_1
centroids_1
plt.scatter(x1[y_kmeans_1 == 0, 0], x1[y_kmeans_1 == 0, 1], s = 100, c = 'pink', label = 'target')
plt.scatter(x1[y_kmeans_1 == 1, 0], x1[y_kmeans_1 == 1, 1], s = 100, c = 'yellow', label = 'general')
plt.scatter(x1[y_kmeans_1 == 2, 0], x1[y_kmeans_1 == 2, 1], s = 100, c = 'cyan', label = 'miser')
plt.scatter(x1[y_kmeans_1 == 3, 0], x1[y_kmeans_1 == 3, 1], s = 100, c = 'magenta', label = 'spendthrift')
plt.scatter(x1[y_kmeans_1 == 4, 0], x1[y_kmeans_1 == 4, 1], s = 100, c = 'orange', label = 'careful')
plt.scatter(centroids_1[:,0], centroids_1[:, 1], s = 50, c = 'blue' , label = 'centeroid')
plt.style.use('fivethirtyeight')
plt.title('Clusters: Annual Income vs Spending Score', fontsize = 20)
plt.xlabel('Price')
plt.ylabel('Income')
plt.legend()
plt.grid()
plt.show()
Now let's study the 4KPIs and 5Ps for each WB brands.
See below for the mean of each KPIs for each brand.
brand_kpi_mean = bera.groupby('BRAND_LABEL')['FAMILIARITY', 'REGARD', 'UNIQUENESS', 'MEANINGFULNESS',
'PRODUCT', 'PRICE', 'PROMOTION', 'PLACEMENT', 'PEOPLE'].mean()
brand_kpi_mean
The mean of each score dosen't differ that much, ranging from 4-5.
Let's visilize the KPIs for each brands with radar plots.
brand_lable = sorted(bera.BRAND_LABEL.unique())
fig = go.Figure()
for i in [3,4,8]:
fig.add_trace(go.Scatterpolar(r=brand_kpi_mean.iloc[i].values,
theta=brand_kpi_mean.columns,
fill='toself',
name=brand_lable[i],
showlegend=True,)
)
fig.update_layout(polar=dict(radialaxis=dict(
visible=True,
range=[3.5, 6])),
title="Average KPIs & 5Ps of Animation Brands"
)
fig.show()
Among those three animations, although Scooby Doo has the highest farmilarity, people has the least willingness to pay for it.
Looney Tunes overall has the best well rounded score,with an especially high in PRODUCT. This can indicates Looney Tuens related products can be very popular and greatly meet consumers' needs.
fig = go.Figure()
for i in [0,5,9,7,2]:
fig.add_trace(go.Scatterpolar(r=brand_kpi_mean.iloc[i].values,
theta=brand_kpi_mean.columns,
fill='toself',
name= brand_lable[i],
showlegend=True,)
)
fig.update_layout(polar=dict(radialaxis=dict(
visible=True,
range=[3.5, 6])
),
title="Average KPIs & 5Ps of Super Heroes"
)
fig.show()
Among those Justice League Brands, people's response shows that they are much more familiar with the individual heroes rather than the Justice League over all. I think that's because
As expected, Batman and Superman owns the highest familarity. However, people has low willingness to pay for them, especially Superman.
Wonder Women stands out on all 5 marketing levers, which can indicate we have done a greate on the marketing and promotion of Wonder Women. The data are recorded this Jan, when the WW84 just released, so this result makes sense.
For the Flash, it especially lacks on the PRICE, however, it did a good job on its PRODUCT. I think the Tshirt and many other Merchandise with the its flash logo are very popular. I remember Shlton has one flash T-shirt he loves a lot.
fig = go.Figure()
for i in [1,6]:
fig.add_trace(go.Scatterpolar(r=brand_kpi_mean.iloc[i].values,
theta=brand_kpi_mean.columns,
fill='toself',
name="KPI-mean-%s"%brand_lable[i],
showlegend=True,)
)
fig.update_layout(polar=dict(radialaxis=dict(
visible=True,
range=[3.5, 6])
),
title="Average KPIs & 5Ps of Other Franchise Brands"
)
fig.show()
For the Franchies Series, Harry Potter defintely performs best, but not that good as I expected when compred with those previous brands. People are farmiliar with it, but has very low willingness to pay for it. It also lacks on the promotion, my guess is that maybe we haven't conduct the campign of Happy Potter for a while because the whole film series has already finished many years ago, and people's attention has been distracted.
Comparatively, The Conjuring Series is not that successful, it especially lacks on PROMOTION and MEANINFULNESS. So I think we can do a better job on promoteing this brand, as well as improving on the meaningfulness of the movies.
sincerity = bera[["BRAND_LABEL", "DOWNTOEARTH", "FAMILYORIENTED", "SMALLTOWN", "HONEST", "SENTIMENTAL",
"REAL", "WHOLESOME", "ORIGINAL", "CHEERFUL", "SINCERE", "FRIENDLY"]]
excitement = bera[["BRAND_LABEL", "DARING", "TRENDY", "EXCITING", "SPIRITED", "INDEPENDENT", "YOUNG",
"IMAGINATIVE", "UNIQUE_", "UPTODATE", "COOL", "CONTEMPORARY"]]
competence = bera[["BRAND_LABEL", "RELIABLE", "HARDWORKING", "SECURE", "INTELLIGENT", "TECHNICAL",
"CORPORATE", "SUCCESSFUL", "LEADER", "CONFIDENT"]]
sophistication = bera[["BRAND_LABEL", "UPPERCLASS", "GLAMOROUS", "GOODLOOKING", "CHARMING", "FEMININE", "SMOOTH"]]
ruggedness = bera[["BRAND_LABEL", "OUTDOORSY", "MASCULINE", "TOUGH", "RUGGED"]]
sincerity_sum = sincerity.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
sincerity_sum.columns = ['attribute', 'frequency']
sincerity_sum = sincerity_sum.sort_values('frequency', ascending=False)
#sincerity_sum
excitement_sum = excitement.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
excitement_sum.columns = ['attribute', 'frequency']
excitement_sum = excitement_sum.sort_values('frequency', ascending=False)
#excitement_sum
competence_sum = competence.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
competence_sum.columns = ['attribute', 'frequency']
competence_sum = competence_sum.sort_values('frequency', ascending=False)
#competence_sum
sophistication_sum = sophistication.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
sophistication_sum.columns = ['attribute', 'frequency']
sophistication_sum = sophistication_sum.sort_values('frequency', ascending=False)
#sophistication_sum
ruggedness_sum = ruggedness.drop('BRAND_LABEL', axis=1).sum().to_frame().reset_index()
ruggedness_sum.columns = ['attribute', 'frequency']
ruggedness_sum = ruggedness_sum.sort_values('frequency', ascending=False)
#ruggedness_sum
imagery = [sincerity_sum, excitement_sum, competence_sum, sophistication_sum, ruggedness_sum]
imagery_df = pd.concat(imagery)
Let's visilize the overall image of WB's brands with wordcloud.
d = {}
for a, x in imagery_df.values:
d[a] = x
#wordcloud = WordCloud(background_color="white", mask = mask)
wordcloud = WordCloud(background_color="white")
wordcloud.generate_from_frequencies(frequencies=d)
mask = np.array(Image.open("WB.png"))
image_colors = ImageColorGenerator(mask)
plt.figure(figsize=[10,10])
plt.imshow(wordcloud.recolor(color_func=image_colors), interpolation="bilinear")
#plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Our brands stands out most with the image of Family-oriented. On audience targeting, this may inform us to segment the audience with families.
So now let's study the image for each of our brands.
imagery_all = [sincerity, excitement, competence, sophistication, ruggedness]
imagery_all_df = pd.concat(imagery_all)
imagery_all_sum = imagery_all_df.groupby('BRAND_LABEL').sum()
imagery_all_sum.drop(["OUTDOORSY", "MASCULINE", "TOUGH", "RUGGED"],
inplace=True, axis=1, errors='ignore')
imagery_all_sum
Let's visilize the image of each brand by plotting its top 10 attributes composition.
for i in range(10):
value = pd.DataFrame(imagery_all_sum.iloc[i].values)
col = pd.DataFrame(imagery_all_sum.columns)
df = pd.concat([col, value], axis=1)
df.columns = ['attr', 'frequency']
top_10 = df.sort_values(by='frequency', ascending=False).head(10)
plt.figure(figsize=(8, 5))
squarify.plot(sizes=top_10.frequency, label=top_10.attr, alpha=0.6)
plt.title("Imagery of %s"%brand_lable[i], size = 15)
plt.show()
The top 10 attributes of each brand makes sense and correspondes to their features.